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Abstract 

There are two major approaches for structured classification. One is the probabilistic 
gradient-based methods such as conditional random fields (CRF), which has high accuracy 
but with drawbacks: slow training, and no support of search-based optimization (which 
is important in many cases). The other one is the search-based learning methods such 
as perceptrons and margin infused relaxed algorithm (MIRA), which have fast training 
but also with drawbacks: low accuracy, no probabilistic information, and non-convergence 
in real-world tasks. We propose a novel and “shockingly easy” solution, a search-based 
probabilistic online learning method, to address most of those issues. This method searches 
the output candidates, derives probabilities, and conduct efficient online learning. We 
show that this method is with fast training, support search-based optimization, very easy 
to implement, with top accuracy, with probabilities, and with theoretical guarantees of 
convergence. Experiments on well-known tasks show that our method has better accuracy 
than CRF and almost as fast training speed as perceptron and MIRA. Results also show 
that SAPO can easily beat the state-of-the-art systems on those highly-competitive tasks, 
achieving record-breaking accuracies. 


Keywords: structured prediction, graphical model, search-based learning, online learn¬ 
ing, convergence 


1. Introduction 


Structured classification (structured prediction) models are popularly used to solve structure 
dependent problems in a wide variety of application domains, including natural language 
processing, bioinformatics, speech recognition, and computer vision. To solve those prob¬ 
lems, many structured classification methods have been developed, most of which are from 
two major categories. One is th e probabilistic gradie nt-based learning methods such as 
conditional random fields (CRF) (jLaffertv et al.l . 120011 ). The other category of structured 
classification metho ds are the search-based learn ing methods, such as margin infused r elaxe d 
algorithm (MIRA) ( Crammer and Singer . 200,11 ) and structured perceptrons ( Collins . 2002). 
Other related work on stru ctured classification also includes maximu m margin Markov net¬ 


work s (jTaskar et al, 
20041 '). 


2003]) and structured support vector machines (jTsochantaridis et al 
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As for the probabilistic gradient-based learning methods such as CRF, they have high 
accuracy because of the exact calculation of the gradient and probabilistic information. 
Nevertheless, those methods have critical drawbacks: 

• First, the probabilistic gradient-based methods typically do not support search-based 
optimization (search-based learning), which is important in structured classification 
problems with complex structures. In the tasks with complex structures, the gradient 
computation is usually quite complicated and even intractable. This is mainly because 
dynamic programming for calculating gradient is hard to scale to complex structures. 
On the other hand, the search technique is easier to scale to complex structures. That 
is why the gradient-based methods like CRF are usually only applied to relatively 
simple structures like sequential tagging, and it is rarely used for more complex tasks 
with or beyond tree structures. Take the syntactic parsing task with tree structures 
for example, instead of CRF, most of the existing systems are based on perceptrons 
or MIRA, because they support search-based learning. This is because search-based 
learning is much simpler than gradient-based learning — just search the promising 
output candidates and compare them with the oracle labels and do the weight update 
accordingly. 


The second issue is that the training of probabilistic gradient-based methods like CRF 
is computationally expensive and quite slow in practice. The reason is that training 
the CRF model requires the gradient computation for gradient-based optimization 
(e.g., in stochast i c gradient descent training or traditional b atch training methods) 
( Sha and Pereira . 200,4 Vishwanathan et ah . 2006; Sun, 20141 ). The gradient compu¬ 
tation is computationally costly, especially when the tag set is with relatively high 
dimension. 


The other category of structured classification methods are the search-based learning 
methods, such as structured perceptrons and MIRA. A major advantage of those methods 
is that they support search-based learning, such that the gradient is not needed and the 
learning is done by simply searching and comparing the promising output candidates with 
the oracle labels, and then update the model weights accordingly. As a by-product of 
the avoidance of gradient computation, those methods have fast training speed compared 
with probabilistic gradient-based learning methods like CRF. However, there are also severe 
drawbacks of the existing search-based learning methods: 

• First, the existing search-based learning methods like perceptrons and MIRA have rel¬ 
atively low accuracy, compared with the probabilistic gradient-based learning methods 
like CRF. 


Second, in most of the real-world tasks, those search-based learning methods are 
non-convergent, i.e., diverges in the training. As large margin classification models, 
theoretically those search-based learning methods have some convergent properties 
based on strict separability conditions. However, those strict separability conditions 
are not sa, t isfiab le in most real-world tasks, as demonstrated in many prior work 
(jSun et al.l . l2014l l. We will also shown in the experiments that those search-based 


learning methods diverges dramatically as the training goes on, such that the model 
accuracy goes worse and worse as training goes on. 
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• The existing search-based methods do not support probabilistic information. The 
magnitude of model weights grows dramatically as training goes on, and there is no 
reliable probabilistic information can be derived. We will also shown the curves of the 
model weight magnitude in the experiments. 


To address those issues, we propose a novel and “shockingly simple” solution, a search- 
based probabilistic online learning framework (SAPO), which can fix almost all of those 
drawbacks. The proposed method searches the top-n output candidates, derives probabil¬ 
ities based on the searched candidates, and conduct fast online learning by updating the 
model weights. 

We show that the proposed method is of fast training speed which is comparable with 
perceptrons and MIRA, supports search-based optimization and no need to calculate gradi¬ 
ent, very easy to implement, with top accuracy which is even better than CRT, with reliable 
probability information, and with theoretical guarantees of convergence towards the opti¬ 
mum given reasonable conditions. Although in current stage our experiments are more 
focused on linear-chain tasks, the method and the theoretical results may apply to struc¬ 
tured classification with more complex structures, for example tree and graph structures. 
Experiments on well-known tasks show that our method has better accuracy than CRF 
and almost as fast training speed as perceptron and MIRA. Results also show that SAPO 
can easily beat the state-of-the-art systems on those highly-competitive tasks, achieving 
record-breaking accuracies. 

The contributions of this work are two- fold0 


• On the methodology side, we propose a general purpose search-based probabilistic 
online learning framework SAPO for structured classification. We show that SAPO 
can address a variety of issues of existing methods, and with theoretical justifications. 
Compared with probabilistic gradient-based learning methods like CRF, the proposed 
method supports search-based learning such that can avoid complex gradient calcu¬ 
lation, and with extra advantages on accuracy and training speed. Compared with 
search-based learning methods like perceptron and MIRA, SAPO has much higher ac¬ 
curacy, and with theoretical and empirical justifications of convergence — perceptron 
and MIRA diverge in real-world tasks, as to be shown in experiments. 


• On the application side, for several important natural language processing and sig¬ 
nal processing tasks, including part-of-speech tagging, biomedical entity recognition, 
phrase chunking, and activity recognition, our simple search-based learning method 
can easily beat the state-of-the-art systems on those highly-competitive tasks, achiev¬ 
ing record-breaking accuracies yet with fast speed. 

2. Proposed Method 

We first describe the proposed search-based probabilistic online learning algorithm SAPO, 
then we compare SAPO with existing methods. 
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Algorithm 1 Search-based Probabilistic Online Learning Algorithm (SAPO) 

1 : input; top-n search parameter n, regularization strength A, learning rate 7 
2: repeat 

3: Draw a sample z = {x,y*) at random from training set S 

4: Based on w, search the top-n outputs Yn = {yi,y2, ■ ■ ■ ,yn} 

5: For every y^ G Yn, compute the probability = P{yk\x,w) 

6 : For every yk € Yn, update the weights by w w — 'yPkF{x,yk) 

7: For y*, update the weights by w ^ w + yF{x,y*) 

8 : Regularize the weights by w w — j^Vi?(ro) 

9: until Convergence 
10 : return the learned weights w* 


2.1 Search-based Probabilistic Online Learning 

The proposed search-based probabilistic online learning algorithm SAPO has the key 
schemes as follows; top-n search (can either be exact search or approximate search), a 
scheme for calculating probabilities, perceptron-style update for weights, and a regularizer 
on weights. We introduce the technical details of the key schemes as follows, and after that 
we summarize the SAPO algorithm in a Figure. 

First, SAPO draws a training sample z = {x,y*) at random from training set S, and 
search for the top-n outputs; 

Yn = {yi,y2,--- ,yn} 


There are many m ethods to realize top-n search. One method uses the A* search algorithm 


( Hart et ah . 19681 ). An A* search algorithm with a Viterbi heuristic function can be used 


to produce top-n output s one-by-one in a efficient manner. We use the backward Viterbi 
algorithm ( Viterbil . 19671 ) to compute the admissible heuristic function for the forward-style 
A* search. In this way we can produce the top-n taggings efficientlyH 

Then, for every yk ^Yn, compute the probability with a log-linear fashion; 


Pk = P{Vk\x,w)^ 


exp[w^P(a:,j/fc)] 

Ev3/Gy„exp[w^F’(x,y)] 


( 1 ) 


where w is the vector of the model weights, F{x,yk) is the feature vector based on x and 
yk, and Yn is simply the top-n outputs defined before. With this definition, we can see that 
Fk = 1- That is, we use top-n search results to estimate the probability distribution, 
which is typically defined as (e.g., in CRF); 


I X A exjp[w'^F{x,yk)] 

Y.wy^Mw^F{x,y)] 

1 . The SAPO code will be released at http://klcl.pku.edu.cn/member/sunxu/code.htm 

2. Note that, although our search is “exact” top-n search, actually “exact” top-n search is not strictly 
required in the SAPO framework. In other words, we can replace exact A* search with non-exact beam 
search scheme for the SAPO algorithm. In experiments we tested both exact A* search and non-exact 
beam search with pruning (beam size is 50), and we find that there is almost no difference on the 
experimental results. 
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As we can see, the only difference is the normalizer — we use top-n search results to esti¬ 
mate the normalizer. With the growth of n, this probability estimation in ([T|) goes to more 
and more accurate towards the traditional probability in ([2]). On the theoretical side, we 
will show in theoretical analysis that this probability estimation can be arbitrary-close to 
the traditional probability by using a proper n, and the SAPO algorithm is guaranteed to 
converge towards the optimum weights w* with an arbitrary-close distance, given reason¬ 
able conditions. On the empirical side, we will show in experiments that the probability 
estimation is good enough for most real-world tasks even with n = 5 or n = 10. 

After that, SAPO updates the weights with a perceptron fashion. For every € Yn, 
the weights are updated as follows: 

w - -fPkF{x,yk) (3) 


As we can see, this is similar to the perceptron update, except with an additional learning 
rate 7 and a probabilistic scaler P^. On the other hand, for the oracle tagging y*, the 
weights are updated by 

ww + ^F{x,y*) (4) 


As we can see, this is also similar to the perceptron update. There is no need to use a 
probability scaler here, because the probability is 1 here. 

Finally, SAPO uses a weight regularizer with regularization stren gth A, just like the 


2003: 

Niu et ah. 

2011 

Sun et ah. 

201 4li 

regularization strength turns to A/ S 

2003: 

Niu et ah. 

2011 ; 

Sun et ah. 

201 4lL 


Following the regularization scheme of SGD, the 
in the online learning setting (jBottou and LeCunl . 
^). Also, the regularization should be scaled with the 


learning rate 7 . Thus, by using a regularizer denoted as R(w), the regularization step is as 
follows: 

w^w--^'VR{w) (5) 

l‘->| 


The regularizer R{w) can be L 2 , Ti, or other alternative regularization terms. For simplicity, 
in this work we use the most widely used L 2 regularizer (a Gaussian prior). 

To sum up, the SAPO algorithm is summarized in Figured! 


2.2 Comparison and Discnssion 

Among the existing structured classifi cation method s, the most similar and rel a ted m ethods 
to SAPO are structured perceptrons ( Collins . 2002 1 and CRF ( Laffertv et ah . 2001 1. 

If we compare S APO with the structured perceptron ( Collins . 2 OO 2 I I and CRF 
( Laffertv et al.l . I2OO1I I with stochastic training, it is interesting to see that SAPO is like 
a “unification” of the perceptron and the stochastically trained CRF. If we neglect the 
learning rate and the regularizer term of SAPO, the perceptron algorithm ( Collinsl . 20021 ) 
can be seen as an extreme case of SAPO with n = 1 (i.e., using top-1 search instead of 
top-n search). On the other hand, the stochastically trained CRF can be seen as another 
extreme case of SAPO with exponentially big n to enumerate all possible output taggings 
(the only difference is that CRF uses dynamic programming instead of top-n search). 

In other words, perceptron can be seen as SAPO with extremely small n, and CRF 
can be seen as SAPO with extremely big n. We argue that SAPO is more natural than 
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both perceptions and CRF — we should use a moderate value of n instead of an extremely 
small n (perceptions) or an extremely huge n (CRF). As we will show in experiments and 
theoretical analysis, an extremely small n like perception will lead to the low accuracy and 
non-convergent training, and an extremely huge n like CRF will also lead to loss of accuracy 
(due to the overfitting of probabilities) and high computational cost. In practice, we find it 
is good enough to use n = 5 or n = 10 for real-world tasks. 


The MIRA algorithm al so has a variation of Nbest MIRA which also uses top-n search 


(Crammer and Sineei. 

2003). and interestingly, it is also good enough to use n = 5 or n = 10 

for Nbest MIRA ( 

Crammer and Sintrer. 

2003 

McDonald et ah. 

200,4 

Chians. 

20121'). Never- 


theless, SAPO is substantially different compared with Nbest MIRA. The major difference 
is that SAPO has probability estimation of different outputs while Nbest MIRA has not. 
Nbest MIRA treat different outputs equally without probability difference, and this is why 
CRF cannot be seen as a special case of Nbest MIRA. Even if Nbest-MIRA uses extremely 
huge n in top-n search, it is not equivalent to CRF and the difference is substantial. Also, 
there are other differences between SAPO and Nbest MIRA. For example, SAPO has the 
regularize! term and the learning rate, and has no need to use the “minimum change” 
optimization criterion of MIRA during weight update. 

3. Theoretical Analysis 

Here we give theoretical analysis on the objective function, update term, convergence con¬ 
ditions, and convergence rate. 

3.1 Objective Function and Update Term 

Here we analyze the equivalent objective function of SAPO and the update term of SAPO. 
The SAPO algorithm (Algorithm [1]) is a search-based optimization algorithm, so that there 
is no need to compute the gradient of an objective function, and there is no explicit objective 
function used in the SAPO algorithm. Nevertheless, interestingly, we show that the SAPO 
algorithm is convergent and it converges towards the optimum weights w* which maximizes 
the objective function as followsH 


m 

maximizetu ^log P{y*\xuw) - XR{w) (6) 

^=1 

where m is the number of training samples, i.e., m = |5|, and R(w) is a weight regularization 
term for controlling overfitting. This objective function is similar to the objective function 
of CRF. Equivalently, for the convenience of convex-based analysis, we denote the objective 
function f{w) as the negative form of ([6]): 


m 

f{w) = -^\ogP{y*\xi,w) + \R{w) (7) 

i=l 


3. The subscript of y is overloaded here. For clarity throughout, y with subscript i and usually with the * 
mark refers to the tagging of the i’th indexed training sample (e.g., y*), and y with subscript k refers to 
the fc’th output of the search (e.g., yp- 
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We show that the SAPO algorithm converges towards the optimum w* which minimizes 
the convex objective function of f(w): 


w* = minimizeu,/(t/i) 


( 8 ) 


To clarify the theoretical analysis, we compare SAPO with the SGD (stochastic gradient 
descent) training scheme. Recall that the weight update has the following form in SGD 
( Bottou and LeCun . 2003 : Niu et ah . 2011)0 


w - 1 ^ w — 7 V/z(t«) 


(9) 


where Vfz{w) is the stochastic gradient of f{w) based on the sample 2 , which has the 
following form if use the CRF objective function: 


yfz{w) = -|f(x, 2 /*) -'^P{y\x,w)F{x,y) - ■|^VR(«i)} 

\/y 




( 10 ) 


exp[w'^F(x,y')]* 1‘S'I 


To make a comparison, we denote Sz{w) as the (negative) SAPO update term for a sample 
2 such that 

w ^ w — "fSz{w) (11) 

Then, according to the procedure of SAPO algorithm, it is easy to check that Sz{w) has the 
following form: 

A 


Sz{w) = -^F{x,y*) - '^PkF{x,yk) - |^VR(w;)| 


k=l 


-{f{xa 


^ exp[w^ F{x,y)] 
^V3/'Gy„ ex; 




( 12 ) 


As we can see from m and m, by increasing n, the SAPO update term Sz(w) can be 
arbitrary-close to the stochastic gradient V/z(tii). More formally, define 


Sziw) = Vfziw) -Sziw) 

Then, for any e > 0, there is at least a corresponding n such that, 

5z{w) < e 


(13) 


(14) 


In other words, when n is increasing, the approximation is expected to be more and more 
accurate and hnally 6z{w) < e. 

4. In practice, SGD and SAPO can use decayed learning rate or fixed learning rate. Following dNiu et all . 
l201ll:[Sunl.l2014l') . for the convenience of theoretical analysis, our theoretical analysis is more focused on 
SGD and SAPO with fixed learning rate. 
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3.2 Optimum, Convergence, and Convergence Rate 

Recall that f{w) is the structured classification objective function, and w € W is the weight 
vector. By considering the time stamp t, the SAPO update (llip can be reformulated as 
follows: 

Wt+l ^Wt- JSzt (wt) (15) 

To state our converge nce analysis results, we need several assumptions following 
Nemirovski et al. ( 2009l i . We assume / is strongly convex with modulus c, that is, 


Vw,w' G W, 


f{w') > f{w) + {w' — w)'^'Vf{w) + - \\w' — w\ 


(16) 


where || • || means 2-norm || • ||2 by default in this work. When / is strongly convex, there 
is a global optimum/minimizer w*. We also assume Lipschitz continuous differentiability 
of V/ with the constant q, that is, 'iWjW' G W, 


(17) 

(18) 
(19) 


||V/(«;') - V/(m;)|| < q\\w' -w\\ 

Also, let the norm of Sz{w) is bounded by k G M"'': 

||Sz(ti^)|| < K 

Moreover, it is reasonable to assume 

7C < 1 

because even the ordinary gradient descent methods will diverge if 7c > 1 (jNiu et al.l . l201lh . 

Based on the conditions, we show that SAPO converges towards the minimum w* of 
f{w) with an arbitrary-close distance, and the convergence rate is given as follows. 

Theorem 1 (Optimum, convergence, and rate) With the conditions m, m, 

m, let e > 0 be a target degree of convergenee. Let r be an approximation-based bound 
from s{w) to Vf{w) such that 

[Vf{w) — s{w)]'^{w — w*) < T (20) 

where w is a historical weight vector that updated during SAPO training, and s{w) is ex¬ 
pected Sz{w) over z such that s{w) = Ez[Sz(ti;)]. Since s{w) can he arbitrary-close to \/f{w) 
by increasing n, SAPO can use the smallest n as far as the following holds: 


Let 'f be a learning rate as 


7 = 


ce 

< — 

- 2q 

(21) 

ce — 2Tq 
jdqKf 

(22) 


where we can set [3 as any value as far as (3 > 1. Let t be the smallest integer satisfying 

PqK? log (gao/e) 


c[ce — zTq) 

where oq is the initial distance such that oq = ||wo — to*|p. 
SAPO converges towards the optimum such that 

E[f{wt) - f{w*)] < e 


(23) 

Then, after t updates of w. 


(24) 
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The proof is in Section [6l 

This theorem shows that the approximation based learning like SAPO is also convergent 
towards the optimum of the objective function. Thus, we can approximate the true gradient 
by top-n search and still keep the convergence properties, without the need to calculate exact 
gradients such as the training of CRF. 

More specifically, the theorem shows that SAPO is able to converge towards the optimum 
of the objective function with arbitrary-close distance e, as far as the SAPO update term 
s{w) is a “close-enough approximation” (i.e., satisfying (|2T]l i of the true gradient Vf{w). 
Since s{w) can be arbitrary-close to V/(ui) by increasing n, SAPO can use the smallest n 
as far as the close-enough approximation (j2ip is achieved. In practice, we find setting n as 
5 or 10 is already empirically close-enough approximation in most of the real-world tasks. 
Moreover, the convergence rate is given in the theorem — SAPO is guaranteed to converge 
with t updates, and t is the smallest integer satisfying (f23ll . _ 

Thi s anal ysis also explains why the perceptron algorithm ( Freund and Schapire . 19991 : 
CoiiimJ. [ionii does not converge in most of the practical tasks. As we discussed before, the 
perceptron algorithm can be essentially treated as extreme case of SAPO, which use the 
extremely small n as 1. In most cases, the use of n = 1 does not satisfy the close-enough 
approximation condition of (1211) . Thus in most cases the perceptron algorithm has a bad 
approximation over the true gradient and it diverges (as we will show in experiments). 


4. Related Work 


Many structured classihcation methods have been developed, including probabilistic 
gradient-based learning methods and search-based learning metho ds. The probabilisti c 


gradient-based learning methods include conditional random fields (I Laffer tv et al.l. 
and a variety of extensions such as dynamic conditional r andom fields ( Sutton et al.l . 


hidden conditi onal random fields (IQuattoni et al.l . 120071 ). and latent-dynamic conditional 


random fields (|Morencv et al 
The search-based 


2001 ), 


2004 ), 


2001 


(jCrammer and Singer 


learn ing 

200.llj 


methods include r nargin i nfuse d relaxed algorithm 
structured perceptrons (jCollinsl . l2002lj. and a varie t y of re ¬ 


confid ence weighted li near classihcation (CW) (jPredze et al.l . l2008l b max-violation percep 


trons (|Yu et al 


lated work in this direction such as latent struc tured perceptronTJSun et al.l . lioo^ . (201 .Ibl b 


201.4 ). Most of the search-based learning methods are large-margin on¬ 
line learning methods. Other related work on structured classihcation also includes maxi- 


mum margin 


Markov networ ks ( Taskar et ah . 200,4 ) and structured support vector machines 


(jTsochantaridis et al.l . |2004) 


For training structured classihcation models, especially probabilistic gradient-based 


method is stochastic gradient descent ISGD'l (Boftou and LeCun. 

200.4: 

Zinkevich et ah. 

2010: 

Niu et ah. 

2011; 

Sun et ah. 

2OI2I. 2OI4I). which typically has faster convergence rate 


compare d with alternative batch t raining methods, such as limited-memory BFGS (L- 
BFGS) ( Nocedal and Wright . 19991 ) and other quasi-Newton optimization methods. The 


SGD training has theoretical guarantee s to converge to th e optimum weights giv e n the 


convex objectiv e function (e.g., GRF) ( Bottou and LeCun . 200,4 : Zinkevich et al. . l2O10l : 
Niu et al] . l201l| j. For the search-based learning methods such as perceptrons, MIRA, and 
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their variation algorithms, the training scheme is usua l ly qu i te simple and self-contained 
in the search-ba s ed le arning algorithm/model ( Collins . 2002 : Crammer and Singer . 20031 : 


McDonald et ah . 20051 ) . 


As for dealing with overfitting, the probabilistic gradient-based learning methods typi¬ 
cally use explicit regularization terms such as the widely used Ly, regularizer. Other regular - 
ization schemes include the Li r egularizer (Andrew and Gao . 2007: iTsuruoka et ah . 20091 ) . 
the group Las so regulariz ation ( Yuan and Linl. I2OO6I: iMartins et al. . l201lh . the structure 
regularization (Sun, 20141 ) . and others ( Quattoni et ah . 20091 ). For the search-based learn¬ 
ing methods like perceptrons and MIRA, the scheme to deal with overfitting is less f ormal 
compared with a regularizer. usual l y by using paramete r averaging or voting ( Collinsl . 20021 : 


McDonald et ah . 2005 : Daume III . 20061 : Chians . 2012). 


5. Experiments 

We describe the real-world tasks for experiments, the experimental settings, and the exper¬ 
imental results as follows. 


5.1 Tasks 

We conduct experiments on natural language processing tasks and signal processing tasks 
with quite diversified characteristics. The natural language processing tasks include (1) 
part-of-speech tagging, (2) biomedical named entity recognition, and (3) phrase chunking. 
The signal processing task is (4) sensor-based human activity recognition. The tasks (1) 
to (3) use boolean features and the task (4) adopts real-valued features. From tasks (1) to 
(4), the averaged length of samples (i.e., the number of tags per sample) is quite different, 
with the length of 23.9,26.5,46.6,67.9, respectively. The dimension of tags jTl is also very 
diversified among tasks, with |T| ranging from 5 to 45. 

Part-of-Speech Tagging (POS-Tag): Part-of-Speech (PCS) tagging is an important 
and highly competitive task in na,tural language processing. We use the standard benchmark 
dataset in prior work ( Collinsl . 2002I L which is derived from PennTreeBank corpus and 
uses sections 0 to 18 of the Wall Street Journal (WSJ) for trai ning (38,219 sainples), and 
sections 22-24 for testing (5,462 samples). Following prior work (|Tsuruoka et al.l . 120111 ). we 
use features based on unigrams and bigrams of neighboring words, and lexical patterns of 
the current word, with 393,741 raw feature^ in total. Following prior work, the evaluation 
metric for this task is per-word accuracy. 

Biomedical Named Entity Recognition (Bio-NER): This task is from the 
BioNLP-2004 shared task, which is for recognizing 5 kinds of biomedical named entities 
[DNA, RNA, etc.) on the MEDLINE biomedical tex t corpus. There are 1 7,484 training 
samples and 3,856 test samples. Following prior work ( Tsuruoka et all . I2OI1I ). we use word 
pattern features and POS features, with 403,192 raw features in total. The evaluation 
metric is balanced F-score. 

Phrase Chunking (Chunking): In the phrase chunking task, the non-recursive cores 
of noun phrases called base NPs are identified. The phrase chunking data is ex t racte d from 
the data of the CoNLL-2000 shallow-parsing shared task ( Sang and Buchholz . 2000l ) . The 


5. Raw features are those observation features based only on x, i.e., no combination with tag information. 
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training set consists of 8,936 sentences, and the test set consists of 2,012 sentences. We use 
the feature templates based on word n-grams and part-of-speech n-grams, with 264,818 raw 
features in total. Following prior studies, the evaluation metric for this task is balanced 
F-score. 


Sensor-based Human Activity Recognition (Act-Recog): This is a task based 
on real- valued sensor signa ls, with the data extracted from the Bao04 activity recognition 


dataset ( Sun et ak . 2013al i. This task aims to recognize human activities (walking, bicy¬ 


cling, etc.) by using 5 biaxial sensors to collect acceleration signals of indivi duals, with 


the sa mpling frequency at 76.25HZ. Following prior work in activity recognition (jSun et al 


2ni3al b we use acceleration features, mean features, standard deviation, energy, and cor¬ 


relation features, with 1,228 raw features in total. There are 16,000 training samples and 
4,000 test samples. Following prior work, the evaluation metric is accuracy. 


5.2 Experimental Settings 


We compared the proposed SAPO algorithm with strong baselines in existing literature, 
including both probabilistic gradient-based learning methods and search-based learning 
methods. For the probab ilistic gradient - based learning methods, we choose the arguably 
most popular model CRF ( Laffertv et ah . 200 ll ) as the baseline. The CRF is with the widely 
used L 2 regularization and is trained with the standard SGD training algorithm. 

For search-ba sed learning methods, we ch oose structured perceptrons (Perc) (jCollinsl . 


2002) and MIRA ( Crammer and Singer . 20031 ). which are arguably the most popular search- 
based learning methods, as the baselines. In most cases, the averaged versions of percep- 
trons and MIRA work empirically better than naive versions of pe rceptron and MIRA 
( Collins . 2002; McDonald et ah . 2005; Daume III . 2006; Chiang . 2012 1. Thus we also com¬ 
pare SAPO with averaged versions of perceptrons and MIRA. To differentiate the naive 
and averaged versions, we denote them as Perc-Naive, Perc-Avg, MI RA-Naive, MIRA-Avg . 
respectively. Moreover , the M IRA method has the Nbest versions (jCrammer and Singer . 


20031 : iMcDonald et al.l . 120051 ). which adopts top-n search and update instead of Viterbi 


search and update. We also choose Nbest versions of MIRA as the additional baselines. 
We denote the Nbest MIRA with naive training as MIRA-Nbest-Naive, and denote the one 
with averaged training as MIRA-Nbest-Avg. 

The regularization strength A of CRF are tuned among values 0.1, 0.5,1, 2, 5, and are 
determined on the development data provided by the standard dataset (POS-Tag) or simply 
via 4-fold cross validation on the training set (Bio-NER, Chunking, and Act-Recog). With 
this automatic tuning for regularization strength, we set 2, 5,1 and 5 for POS-Tag, Bio- 
NER, Word-Seg, and Act-Recog tasks, respectively. To give no tuning advantage to SAPO, 
SAPO simply uses the same regularizer and the same learning rate as CRF use. All the 
tuning are based on CRF, and there is no additional tuning for SAPO. 

Also, the proposed SAPO algorit hm use the same top-ra sear c h scheme as the Nbest 


MIRA use. A s shown in prior work ( Crammer and Singer . 2003 : McDonald et ah . 20051 : 


Chiang . 20121 ). it is good enough to use n = 5 or re = 10 for Nbest MIRA. It is also good 
enough to use re = 5 or re = 10 for the proposed SAPO algorithm. Thus, we set re = 5 
for Nbest MIRA and SAPO for fast speed. All methods use the same set of features. 
Experiments are performed on a computer with the Intel(R) Xeon(R) 3.0GHz CPU. 
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POS-Tag: Accuracy (1) 


POS-Tag: Accuracy (2) 


POS-Tag: Accuracy (3) 




60 80 
Number of Iteration 



POS-Tag: Weight (1) 


POS-Tag: Weight (2) 


POS-Tag: Weight (3) 




POS-Tag: Objective 



POS-Tag: Time Cost 



SAPO CRF MiRA-NA MiRA-NN MIRA-A MIRA-N Perc-A Perc-N 
Method 


Figure 1: Results on the POS-Tag task. 

5.3 Experimental Results 

The experimental results in terms of accuracy/F-score are shown in Figure [H Figure O 
Figure [3l and Figured! respectively. As we can see, although the tasks are with diversified 
feature types (boolean or real-value) and different characteristics, the results are quite 
consistent — the proposed SAPO algorithm has the best accuracies/F-scores in all of the 
four tasks compared with the existing baselines. 

First, we compare SAPO with CRF. It is impressing that the proposed SAPO algorithm 
even has better accuracy than the CRF — CRF is arguably one of the most accurate models 
for structured classification. Note that the CRF model and other baselines are already fully 
optimized, which can be confirmed by comparing those results with the state-of-the-art 
reports on those four tasks (to be shown in a table later). As for the superiority, the reason 
is that the probability is distributed on top-n outputs in SAPO, which is a “regularized” 
distribution than the probability distribution over all possible outputs (an exponential num- 


12 

































Search-based Probabilistic Online Learning 


Bio-NER: FI (1) Bio-NER: FI (2) Bio-NER: FI (3) 
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Figure 2: Results on the Bio-NER task. 


ber). In this sense, SAPO is “regularizing” the exponential probability distribution to a 
simpler top-n probability distribution. This can be seen as a probability-based regularizer 
with the regularization strength controlled by n, and interestingly the experimental results 
suggest that this type of regularization can indeed improve the accuracy/F-score. 

We observe that SAPO is better than CRF in all of the four tasks, and we will show 
that in many cases the differences are statistically significant. Also, we can see that SAPO 
is several times faster than CRF in terms of training time. On convergence state, SAPO 
achieves similar or even better loss function than CRF. 

Second, we compare SAPO with search-based learning methods, including naive/average 
versions of Perc, MIRA, and Nbest MIRA. As we can see, the superiorities of SAPO over 
search-based learning methods are even more significant than over CRF. 
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Chunking: F1 (1) Chunking: F1 (2) Chunking: F1 (3) 
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Figure 3: Results on the Chunking task. 


We also conduct significance tests based on t-testH For the POS-Tag task, the signifi¬ 
cance test suggests that the superiorities of SAPO over all of the baselines except CRF are 
very statistically signihcant, with at least p < 0.01. For the Bio-NER task, the signihcance 
test suggests that the superiorities of SAPO over all of the baselines are significant, with at 
least p < 0.05. For the Act-Recog task, the superiorities of SAPO over all of the baselines 
are very significant, with at least p < 0.01. 

Our method actually outperforms the state-of-the-art records on those competitive nat¬ 
ural language processing tasks. Those datasets are standard benchmark datasets, which can 
directly be compared with existing work. The POS-Tagging task is a highly competitive 
task, with many methods proposed, and the best report (wit hout using extra resources') 
until now is achieved by using a bidirectional learning model in IShen et al.l (j2007l i , with the 
accuracy 97.33%. With 97.38%, our simple method achieves better accuracy compared with 


6. For the tasks measured by F-score, the t-test is approximated by using accuracy to approximate F-score. 
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Act-Recog: Accuracy (1) 
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Figure 4: Results on the Act-Recog task. 


all of those state-of-the-art systems. Our SAPO method also achieves or exceeds the state- 
of-the-art methods on the Bio-NER and Chunking tasks, which are also very competitive 
tasks in natural language processing communities. 

The first row of those figures shows the training curves based on the number of training 
iterations. As we can see, SAPO and CRF is convergent as the training goes on, and Perc, 
MIRA, and Nbest MIRA diverge as the training goes on. 

The second row of those figures shows the wi-complexity based on the number of training 
iterations. The td-complexity is the averaged (absolute) value of the weights. As we can 
see, SAPO and CRF have convergent and very small weight complexity as the training 
goes on, and Perc, MIRA, and Nbest MIRA have linear or even super-linear explosion of 
weight complexity as the training goes on. Big weight complexity is typically a bad sign on 
controlling generalization risk. 
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The left side of the third row of those figures shows the loss function of SAPO and 
CRF based on the number of training iterations. As we can see, SAPO converges as good 
or even better in terms of the loss function, as the training goes on. This confirms our 
theoretical analysis on the convergence of SAPO. The right side of the third row of those 
figures show the training time per iteration in terms of seconds. As we can see, SAPO is 
with low computational cost, especially compared with CRF and Nbest MIRA. 

To summarize, the experiment results demonstrate that SAPO has better accuracy 
than probabilistic gradient-based methods like CRF and at the same time with fast train¬ 
ing speed like perceptrons and MIRA. Also, SAPO is convergent towards optimum and with 
controllable weight complexity as the training goes on. We emphasize that there are other 
important advantages of SAPO that are not shown in those experiments — SAPO sup¬ 
ports search-based learning such that gradient information is not needed, gives probability 
information, and it is very easy to implement. 

6. Proof 

Here we give the proof of Theorem [TJ First, the recursion formula is derived. Then, the 
bounds are derived. 

6.1 Recursion Formula 

By subtracting w* from both sides and taking norms for (IlSp . we have 
\\wt+i - = \\wt - 'JSztiWt) 

= \\wt - w*\\‘^ - 2-/{wt -w*)'^Szt{wt) + 'y'^\\szt{wt)\\‘^ 

Taking expectations and let at = lE||tWi — w*\\‘^, we have 

at+i = at- 2-^K[{wt - w*f'Sztiwt)] 7^E[||s2,(t0t)|p] 

(based on (fTSl) ) 

<at- 2-iW.[{wt - w*)'^Sztiwt)] + 

(since the random draw of Zt is independent of Wt) 

= at- 2-fE[(wt - w*)'^Ezt{szt(wt))] +'y‘^K^ 

= at- 2'yE[{wt - w*)'^s{wt)] + 7 ^k^ 

We define 

6{w) = V/(tt)) — s{w) 

and insert it into (flHl) . it goes to 

f{w') > f{w) -I- (w' — -I- S{w)] + ^\\w' — trip 

= f{w) -I- {w' — w)'^8{w) + ^\\w' — w\\‘^ + {w' — w)^6{w) 

By setting w' = w*, we further have 

{w — w*)^s{w) > f{w) — f{w*) + ^\\w — — {w — w*)'^6{w) 

> ^\\w — w*\\‘^ — {w — w*)^6{w) 


(25) 

(26) 

(27) 

(28) 

(29) 
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Combining (l26l) and (1291) . we have 


at+i <at- 27 E -\\wt - w*|P - {wt -w*)'^6{wt) 
= (1 - c'y)at + 2'yE[{wt - w*)'^6{wt)] + 
Considering (1^ and (1^ . it goes to 

at+i < (1 - c'y)at + 2 'yT + 7 ^^^ 

We can find a steady state Oqo as follows 

Ooo = (1 - C7)aoo + 27 r -h 7 ^^^ 


2,^2 


+ 7 


(30) 


(31) 


(32) 


which gives 


Defining the function A{x) 


O^cyo, — 


2r + 


(1 — C 7 )x -|- 27 r -|- 7 ^K^, based on (l3T]) we have 


at+i < A{at) 

(Taylor expansion of A{-) based on Ooc, with S7‘^A{-) being 0) 
— ^(Uoo) T V^(ttoo)(^t 1 ^ 00 ) 

= A{aoo) + (1 - C7)(ai - Ooo) 

= Ooo + (1 - c'y){at - Ooo) 


Thus, we have 
Unwrapping (1351) goes to 


O't+l ®oo — (1- C7)(ot ®oo) 

a* < (1 - c'yfiao - Ooo) + floo 


(33) 


(34) 


(35) 

(36) 


6.2 Bounds 

Since Vf{w) is Lipschitz according to (fT7|) . we have 


f{w) < f{w') + Vf{w')'^{w — w') + ^\\w — w'W^ 
Setting w' = w*, it goes to f{w) — f{w*) < ^\\w — ty*|p, such that 

E[f{wt) - f{w*)] < ^E\\wt-w*\\‘^ = ^at 


In order to have 


E[f{wt) - fiw*)] < € 

(37) 

it is required that |at < e, that is 

2e 

at < — 

(38) 
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Combining (l36l) and (1381) . it is required that 


2e 

(1 - C7) (ao - Ooo) + Ooo < — 


(39) 


To meet this requirement, it is sufficient to set the learning rate 7 such that both terms on 
the left sic 
it goes to 


the left side are less than |. For the requirement of the second term Ooo < |, recalling ([331) , 


7 < 


ce — 2Tq 

qK? 


Thus, introducing a real value /3 > 1, we can set 7 as 

ce — 2Tq 


7 = 


PqK? 


(40) 


Note that, to make this formula meaningful, it is required that 

ce — 2Tq > 0 

Thus, it is required that 


r < 


ce 


which is solved by the condition of (I2ip . 

On the other hand, we analyze the requirement of the first term that 

(1 - c7)*(ao - Ooo) < - 

q 

Since oq — floo < ao, it holds by requiring 


(1 - cjYao < 


which goes to 


t > 


log 


qao 


log (1 - C7) 


(41) 

(42) 

(43) 


Since log (1 — C7) < — C7 given (fT9l) . and that log ^ is a negative term, we have 

log ^ 


< 


qao 


Thus, (1431) holds by requiring 


log(l —C 7 ) —C 7 

log — 

° qao 


t > 


Combining (fiOl) and (l44]) . it goes to 


t > 


-C7 
log(gao/e) 
C7 


l3qK? log (ggp/e) 
c(ce — 2Tq) 


(44) 


which completes the proof. 
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7. Conclusions and Future Work 

The exiting structured classification methods are problematic. The existing probabilistic 
gradient-based methods such as CRT have slow training and do not support search-based 
optimization. The existing search-based learning methods such as perceptrons and MIRA 
have relatively low accuracy and is non-convergent in most of the real-world tasks. We 
propose a novel and “shockingly easy” solution, a search-based probabilistic online learning 
framework SAPO, to address all of those issues. SAPO is with fast training, support 
search-based optimization, very easy to implement, with top accuracy, with probability 
information, and with theoretical guarantees of convergence. 

Although currently we more focus on sequence structures, the method and the theo¬ 
ries can apply to structured classification with more complex structures, for example trees 
and graphs. Experiments on well-known benchmark tasks demonstrate that SAPO has 
better accuracy than CRF and roughly as fast training speed as perceptrons and MIRA. 
Results also show that SAPO can easily beat the state-of-the-art systems on those highly- 
competitive tasks, achieving record-breaking accuracies. 

In current implementation, our top-n search uses a simple A* search algorithm with 
Viterbi heuristics. This top-n search algorithm is not fully optimized for speed. There 
are several other top-n search algorithms possibly with faster speed. In the future we can 
optimize the top-n search algorithm. We believe this can further improve the training speed 
of SAPO. Moreover, SAPO is a general purpose algorithm for structured classification with 
arbitrary structures. In the future we can apply SAPO to structured classification with 
more complex structures, e.g., syntactic parsing and statistical machine translation. 
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